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We study global fluctuations of the guanine and cytosine base content (GC%) in mouse genomic 
. DNA using spectral analyses. Power spectra S(f) of GC% fluctuations in all nineteen autosomal and 

two sex chromosomes are observed to have the universal functional form S(f) ~ l// a (a ~ 1) over 
several orders of magnitude in the frequency range 10 -7 < / < 10 -5 cycle/base, corresponding to 
long-ranging GC% correlations at distances between 100 kb and 10 Mb. S(f) for higher frequencies 
(/ > 10~ 5 cycle/base) shows a flattened power-law function with a < 1 across all twenty-one chro- 
mosomes. The substitution of about 38% interspersed repeats does not affect the functional form of 
S(f), indicating that these are not predominantly responsible for the long-ranged multi-scale GC% 
fluctuations in mammalian genomes. Several biological implications of the large-scale GC% fluctuation 
are discussed, including neutral evolutionary history by DNA duplication, chromosomal bands, spatial 
distribution of transcription units (genes), replication timing, and recombination hot spots. 

> 

X ' 1 Introduction 



DNA sequences, the blueprint of almost all essential genetic information, are polymers 
consisting of two complementary strands of four types of bases: adenine (A), cytosine (C), 
guanine (G), and thymine (T). Among the four bases, the presence of A on one strand 
is always paired with T on the opposite strand, forming a "base pair" with 2 hydrogen 
bonds; similarly, G and C are complementary to one another, while forming a base-pair 
with 3 hydrogen bonds PQ 12 El- Consequently, one may characterize AT base-pairs as 
"weak" bases and GC base-pairs as "strong" bases. In addition, the frequency of A (G) 
on a single strand is approximately equal to the frequency of T (C) on the same strand, 
a phenomenon that has been termed "strand symmetry" [3] or "Chargaff 's second parity" 
[S]. Therefore, DNA sequences can be transformed into reduced 2-symbol sequences of 
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weak W (A or T) and strong S (G or C) bases. The percentage of S (G or C) bases of a 
DNA sequence segment is denoted as the GC base content (GC%). 

The spatial variation of GC% along a DNA sequence has been of long-standing inter- 
ests |EJ 13 El E] GC%-series can be considered as fluctuating or unsteady signals, and 
consequently many signal processing and stochastic analysis techniques can be applied to 
characterize and quantify the statistical properties of the DNA sequences [TUl ITT) IT2] . In 
particular, spectral and correlation analyzes are standard tools that can be applied [T3j . 
Initial spectral analysis [TTJ EH EH provided evidence that DNA sequences, especially non- 
protein-coding sequences, exhibit a power spectrum S(f) that can be approximated by 
S(f) ~ l/f a (a « 1) and are termed "1/f noise" (or "1/f* noise" with 0.5 < a < 1.5) 
[T71 118| ITT^ I20[ I21| . 1/f noise lies in-between the realm of white noise (a = 0) and Brown- 
ian noise (a = 2) |22] , and is indicative of a wide distribution of length scales (or time, in 
the case of stochastic processes) (2"3j . 

The observation of l/f a spectra in many, but not all, short DNA sequences (of the 
order of a few thousands bases) poses the question of whether l/f a spectra are a universal 
characteristic across all DNA sequences. Several lines of evidence show that the l/f a 
spectrum is indeed a generic phenomenon of GC% fluctuations in DNA sequences and 
is found in genomic DNA sequences from different taxonomic classes, including genomes 
from bacteria (211123, yeast [2H], insect [2Z], worm (W Li, unpublished data), and human 
|2H]- The human and mouse genomes are evolutionarily separated by about 65-75 
million years, and they exhibit a high level of homology |3U] ■ Yet several species-specific 
differences exist that might lead to different functional properties of S(f): 

• While the overall, genome-wide GC% of both human and mouse genomic DNA se- 
quences is about 42%, the distribution of GC% is different: GC% when measured in 
20 kb (20 x 10 3 bases) windows in mouse genomic DNA lacks extremely high and low 
GC% values [30]. 

• There exist pronounced differences between large sequence segments (of the order of 
several Mb (10 6 bases)) of human and mouse chromosomes due to chromosomal rear- 
rangements. At such length scales, GC% correlations existing in the human genome 
may be absent in mouse genome. 

This Letter examines the presence of 1/ f a spectra in spatial GC% variations across all 
Mus musculus chromosomes. A graphic display of the mouse genome GC% fluctuation can 
be found at (SU - 
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2 Material and Methods 
2.1 DNA Sequence Data 

We download mouse genomic DNA sequences for nineteen autosomal chromosomes (Chrl- 
Chrl9) and two sex chromosomes (ChrX and ChrY) from the UCSC Genome Bioinfor- 
matics Site http://genome.ucsc.edu/ (the October 2003 release, or UCSC version mm04). 
All twenty-one chromosomes are evenly partitioned into 2 17 = 131,072 non-overlapping 
windows of uj bases, and the GC% of each window is computed. A fraction of bases are 
yet uncharacterized and at these positions A, C, G, or T is substituted by the symbol "N" 
(sequence gaps). Windows that contain only uncharacterized bases have an undetermined 
GC% value. In this study, we replace all undetermined GC% values by randomly chosen 
values from a normal distribution with GC% mean and variance taken from the empirical 
distribution of all determined GC% windows. 



Figure 1: Genome- wide GC content (GC%) of mouse Mus musculus for overall, non-repetitive, and 
repetitive sequences. 

Higher eukaryotic genomes are enriched in repetitive sequences |32j . Repeats are ap- 
proximate copies of DNA sequence segments, and interspersed repeats are an abundant 
class of repetitive sequence segments in mammalian genomes that scattered throughout 
the genome. In both the human and the mouse genome, transposon-derived interspersed 
repeats constitute about 35-45% [231 HHj and 38% [SUj of the total genome, respectively. 
As can be seen from Fig. 1, the distribution of GC% for repetitive sequences is markedly 
different from that of the non-repetitive sequences. In order to study the effects of inter- 
spersed repeats, we use interspersed repeat-annotated versions of mouse chromosomes 
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and separately analyze GC% fluctuations obtained from DNA sequences with retained and 
substituted interspersed repeats. In the latter DNA sequences, we substitute GC% from 
interspersed repeats by randomly GC% values taken from a normal distribution with mean 
and variance taken from the empirical distribution of the non-repetitive proportion of each 
individual chromosome. 

2.2 Spectral analysis of DNA Sequences 

We coarse-grain sequences into a spatial-series of GC contents (GC%) and conduct 
spectral analysis of the spatial GC%-series. To this end, we chose a window of size u 
bases, compute GC%, move the window along the DNA sequence by Au; bases, and iterate 
the computation to obtain a spatial-series of GC% values. Non-overlapping windows are 
obtained by setting Au = uj. 

The power spectrum, the absolute squared average of the Fourier transform, is defined 

as 

2 

(1) 

where N is the total number of windows. Table 1 lists the window sizes and averaged 
GC% calculated at these window sizes for all chromosomes. 

Table 1: Window sizes (u>) and average GC contents (GC%). Each mouse chromosome (Chr) is partitioned 
into 2 17 non-overlapping windows. 



Chr 


GC% 


u> (kb) 
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41 


1.52 


11 


44 


0.93 
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42 


1.39 


12 


42 


0.88 
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40 


1.25 


13 


42 


0.90 
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42 


1.18 


14 


41 


0.90 
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43 


1.15 


15 


42 


0.81 
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41 


1.15 


16 


41 


0.76 
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43 


1.05 


17 


43 


0.73 
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42 


1.00 


18 


41 


0.69 
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43 


0.96 


19 


43 


0.47 


10 


41 


1.02 


X/Y 


39/39 


1.21/0.17 



3 Results 

We use the computational Fast Fourier Transform (FFT), implemented in the S-PLUS 
statistical package (Version 3.4, MathSoft, Inc.). The S-PLUS subroutine Spectrum takes as 
input a discrete FFT to calculate as output a periodogram (the power spectrum in units 
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10-log 10 S(f)), and subsequently applies Daniell-filtering (i.e. rectangular window) [3o1 l37] 
to compute a smoothed spectrum using a user-specified parameter value (span). 

Figure 2 shows the power spectrum S(f) as a function of the frequency / across nineteen 
autosomal and two sex chromosomes. We find for sequences with retained interspersed 
repeats that S(f) exhibits the functional form S(f) ~ l/f a persistently across twenty-one 
chromosomes. The exponent a is close to a ~ 1 for frequency ranges of 10 -7 < f < 10~ 5 
cycle/base, corresponding to length scales L — 1/f of lOOkb < L < 10Mb. At higher 
frequencies / > 1CT 5 cycle/base (L < 100 kb), S(f) generally becomes flattened with 
a < 1 across all chromosomes. This deviation from the 1/f spectrum was also observed in 
human genome [21 123 121 At lower frequencies / < 10" 7 cycle/base (L > 10 Mb), there 
are much less spectral components, and hence S(f) shows relatively larger fluctuations, and 
the estimation of S(f) ~ l/f a is less reliable. In the frequency range of 10~' < f < 10~ 8 
cycle/base, only S(f) for Chr2, chr4, Chr7, Chrll and Chrl6 is indicative of a persistence 
of a « 1. 

When we compute S(f) for mouse chromosomes 1, 2, . . . , 19, X and Y with substituted 
interspersed repeats, we find that S(f) is higher than S(f) obtained for the original se- 
quences, especially at frequency ranges higher than / > 10~ 5 (L > 10 kb). One possible 
explanation is that the substitution of GC% estimated from repetitive GC bases by random 
GC% values increases the level of white noise fluctuations at length scales comparable to 
lengths of interspersed repeats. 

It is interesting to note that the substitution of about 38% interspersed repeats hardly 
affects S(f) at intermediate and lower frequencies. A similar observation has been made 
for human genomic DNA sequences [231211 ■ Thus, interspersed repeats may not contribute 
predominantly to long-ranging correlations in mammalian genomic DNA. 

4 Discussion 

4.1 Spatial 1/f spectra are not an in generally held property 

Before discussing the universal spectral shape of GC% of genomic DNA sequences, note 
that not all spatial sequences or signals exhibit l/f a (a ~ 1) power spectra. 

An instructive example is provided by the spatial spectrum of images taken of natural 
scenes, where it is known that the amplitude spectrum of image pixels is typically 1/f, and 
consequently its power spectrum is S(f) ~ 1/f 2 [23 EOl E] • Sometimes, the exponent a 
in S(f) ~ l/f a may not be exactly equal to 2: for example, it was shown that underwater 
images tend to exhibit a larger exponent of a (or deeper slope) than that of atmospheric 
images [121 • There are mainly two theories of the 1/f 2 scaling in such images: (i) it is 
caused by luminance edges, and (ii) it is caused by a power-law distribution of sizes of 
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regions with constant intensity (121 • Experiments have been carried out to test whether 
images with a change of the slope a can be detected visually by human objects |4*4l 145] . 

This well established 1/f 2 spatial power spectrum in images provides a case example 
that spatial power spectra are not necessarily of the form S(f) ~ 1/f. Rather, S(f) ~ 1/f 
and 1/f 2 spectra are considered to belong to two different classes ;22j, and so the exponent 
awl observed in DNA sequences is not a priori expected. 

4.2 Spatial 1/f spectra are consistent with the evolutionary expansion-modification 
model 

One hypothesis is that l/f a (a ~ 1) constitutes a universal property of all long DNA 
sequences subject to neutral evolution that involved duplications and mutations [THUS]. A 
simplified model, termed "expansion-modification (EM) model" |4"7ll4T)] generates a binary 
2-symbol sequence by two local operations: (i) expansion/duplication: — > 00, 1 — ► 11; 
and (ii) modification/mutation: — ► 1, 1 — > 0. When the probability of the first operation 
is large (e.g. probability pi = 0.9), resulting binary sequences exhibit 1/ f a (a ~ 1) power 
spectra. If the probability of the second operation (p 2 ) is large, the resulting sequences 
exhibit white spectra (a ~ 0) |4"o] . Since the sequence generating process is hierarchical, it 
is implicit that the resulting sequence exhibit scale-invariance (or perhaps multiple-scale- 
invariance |4"8j). 

The EM model contains two features that are essential for DNA evolution: duplication 
and mutation. Duplications, both inter-chromosomal or intra-chromosomal, expand DNA 
sequences and provide a potential for genes to develop novel functions jl^l- in one point of 
view, duplications have a larger impact than natural selection in Darwin's evolution theory, 
as duplications and the resulting redundancy actually created the foundation upon which 
natural selection acts [301 • Although point mutations might be detrimental to biological 
fitness, they neverthless provide a potential for evolution, perhaps on a smaller scale as 
compared to that for duplications. 

A more realistic modeling of neutral evolution of DNA sequences by duplication and 
mutation beyond the EM model is still lacking [SI], and so the hypothesis that all long 
DNA sequences undergoing duplications and mutations exhibit 1/ f a (a ~ 1) power spectra 
remains to be validated. The results presented in this paper, that Mus musculus genomic 
DNA exhibits a ~ 1 adds another line of evidence toward verification of this hypothesis. 

4.3 High level of chromosomal segment translocations did not destroy 1/f 
spectra in mouse genomic DNA 

About 90% of the human and 93% of the mouse genome reside within syntenic blocks, in 
which the order of a series of biological markers (e.g. genes) are approximately conserved 
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|30| . However, with the exception of ChrX, these syntenic blocks have different loci at 
human and mouse chromosomes. About 65-75 million years ago, the human genome and 
the mouse genome embarked on a different evolutionary history, with many chromosomal 
translocations, that left syntenic blocks of a human chromosome scattered on different 
mouse chromosomes. 

This basic picture indicates that the observation of 1// spectra in the human genome 
will not guarantee similar spectra in mouse genome, as random translocations can 
easily destroy any long-range order a given DNA sequence. An alternative explanation of 
1/f spectra in the mouse genome, in light of observed 1/f spectra in human genome, is 
that both are in fact a consequence of the large-scale dynamics, such as translocation and 
duplication. Theoretically, DNA sequences frozen at about 65-75 million years ago before 
the divergence of human and mouse species could test this hypothesis. 

Besides translocation/duplication, mutations may also affect GC% fluctuations. If chro- 
mosomal segment translocations were the only process in the evolution, the human and 
mouse genomes ought to have the same GC content distribution. Nevertheless, the GC% 
distribution in the mouse genome is tighter (with smaller variance) and lacks segments with 
higher GC% ("outliers") as in the human genome, a fact both observed experimentally |H2] 
and by sequence analysis [3Tij . 

One might think that this difference of variances of the GC% distributions between 
the human and the mouse genome may render their spatial spectra different. However, 
as pointed out in |53*1 154j . the exponent a in l/f a is related to how the variance of GC% 
changes with the window size, instead of the variance itself (and the isochore or fairly 
homogeneous sequence segment of about 200-300 kb long or more j^Hl E3j is related to the 
asymmetry or the third moment of the GC% distribution). 

4.4 GC content correlates with many biological features and 1/f spectra of 
GC% imply a scale-invariant distribution of these features 

Although variations in GC% are not of direct biological relevance, they are correlated 
with many other other measurements and biological functions of DNA sequences, including 
chromosome bands, protein-coding gene density, replication timing, or recombination hot 
spots EH]- If the correlation between GC% and a biological function is strong, then 
the scale-invariance pattern in GC% can be transformed into a similar spatial structure of 
these biologically functioning units. 

While a stretched out chromosome is too thin to be visible (the diameter of a DNA 
thread is about 2 x 10~ 9 m |3j), during the metaphase of a cell cycle it becomes visible 
because it is tightly packed into a chromatin structure (with the length of a typical human 
chromosome of this compact form of about 10~ 5 m [3j). Using Giemsa dyes to stain chro- 
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mosomes leads to alternating dark and light bands [SB]- The mechanism of this staining 
difference at different chromosomal regions is thought to be caused by the degree of conden- 
sation of the chromatin structure [HH] The connection between GC% and Giemsa bands 
was long before suggested: Giemsa-light bands (termed R-bands) are GC-rich, whereas 
Giemsa-dark bands (G-bands) are GC-poor |67H Ej. This connection has been further 
calibrated, such that being relatively GC-rich or GC-poor as compared to the flanking 
regions is correlated with Giemsa-bands [UT]. This new proposal manages to reproduce 
chromosome bands quite well by sequence analysis [61 . 

After the sequencing of the human genome has been completed, it was revealed that 
many long GC-poor regions contain a low gene density ("gene desert") [SS1E7J, confirming 
the earlier proposed correlation between GC% and gene density [U21 ESI EI]- In fact, the 
gene density not only increases with GC%, but increases faster than in a linear fashion 
|57j . Extremely GC-rich regions in the human genome are also extremely gene- rich, and 
beyond the GC content of GC% ~ 46% the gene-density increases markedly. 

Due to the large genome size for most eukaryotic species, the replication of DNA se- 
quences starts at multiple positions. Specific chromosome regions replicate earlier in time, 
while other regions replicate later, marked by a clear boundary between the two types of 
regions. While a correlation between replicating-timing and chromosome bands was pro- 
posed in (HOI EH] (R-band replicates earlier), the extent and biological relevance are still 
subject to investigation |S5] . 

Furthermore, regions with high recombination rate (recombination ot spots) have been 
associated with being GC-rich [66]. While this correlation has been established for yeast 
Saccharomyces cerevisiae genome sequence (HZ], no conclusive result have yet been ob- 
tained for higher genomes such as human genomic DNA [UH] ■ The general difficulty in the 
determination of regional recombination rates in human genome is that it is either indirect 
(using pedigree analysis (HIS ED!) or limited to only male samples (using sperm typing [7T]). 
Even a newly proposed population genealogy-based inference of recombination rates [UB] is 
still not a direct measurement. In addition, it has also been proposed that recombination 
events increase the local GC content [721 ESI Hlj ■ While switching the role of cause and 
effect, from a statistical (correlation) point of view, the outcome remains the same. 

Instead of using the GC% as a "surrogate", there also have been attempts to study 
large-scale patterns of biological units directly. It has been shown that in circular bacterial 
genome sequences, the positions and orientations of genes do not have any preferable length 
scale and are scale-invariant [75] ■ This result should be directly linked to a similar scale- 
invariance of GC% in bacterial genomes. The universally observed 1/ f a (a ~ 1) spectra in 
the mouse Mus musculus chromosomes, as well as in human chromosomes [2H] , motivates 
further sequence analysis on the spatial arrangement of functional biological units. 
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Figure 2: Double logarithmic representation of the power spectrum 101og 10 S'(/) of GC% fluctuations 
across nineteen autosomal (Chrl-Chrl9) and two sex (ChrX and ChrY) mouse chromosomes. The func- 
tional form of S(f) can be approximated by S(f) ~ 1// Q over several order of frequency magnitudes 
(a = 1 is plotted for comparison). Two curves represent S(f) obtained for DNA sequences with retained 
and substituted interspersed repeats, respectively. For clear representation, S(f) is smoothed at different 
frequency ranges, using Daniell/rectangular- filter with sizes (span parameter) 1, 3, 31, and 501 for the 
1-10, 10-100, 100-500, and 500-65,536 spectral components. Vertical lines mark the length scales L = 1/ f 
(base/cycle) for L = 10 Mb, L = 1 Mb, and L = 100 kb. 



